Incorporating Nesterov Momentum into Adam

نویسنده

Timothy Dozat

چکیده

When attempting to improve the performance of a deep learning system, there are more or less three approaches one can take: the first is to improve the structure of the model, perhaps adding another layer, switching from simple recurrent units to LSTM cells [4], or–in the realm of NLP–taking advantage of syntactic parses (e.g. as in [13, et seq.]); another approach is to improve the initialization of the model, guaranteeing that the early-stage gradients have certain beneficial properties [3], or building in large amounts of sparsity [6], or taking advantage of principles of linear algebra [15]; the final approach is to try a more powerful learning algorithm, such as including a decaying sum over the previous gradients in the update [12], by dividing each parameter update by the L2 norm of the previous updates for that parameter [2], or even by foregoing first-order algorithms for more powerful but more computationally costly second order algorithms [9]. This paper has as its goal the third option—improving the quality of the final solution by using a faster, more powerful learning algorithm.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Don't Decay the Learning Rate, Increase the Batch Size

It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but wi...

متن کامل

Acceleration of Gradient-based Path Integral Method for Efficient Optimal and Inverse Optimal Control

This paper deals with a new accelerated path integral method, which iteratively searches optimal controls with a small number of iterations. This study is based on the recent observations that a path integral method for reinforcement learning can be interpreted as gradient descent. This observation also applies to an iterative path integral method for optimal control, which sets a convincing ar...

متن کامل

Online Learning Rate Adaptation with Hypergradient Descent

We introduce a general method for improving the convergence rate of gradientbased optimizers that is easy to implement and works well in practice. We demonstrate the effectiveness of the method in a range of optimization problems by applying it to stochastic gradient descent, stochastic gradient descent with Nesterov momentum, and Adam, showing that it significantly reduces the need for the man...

متن کامل

Three-cocycles, Nonassociative Gauge Transformations and Dirac’s Monopole

In 1931 Dirac [1] introduced a magnetic monopole into the quantum mechanics and found a quantization relation between an electric charge e and magnetic charge q, 2μ = n, n ∈ Z, where μ = eq, and ~ = c = 1. One of the widely accepted proofs of the Dirac selection rule is based on group representation theory (see, for example, [2, 3, 4, 5, 6]). In the presence of the magnetic monopole the operato...

متن کامل

Improved Stochastic gradient descent algorithm for SVM

In order to improve the efficiency and classification ability of Support vector machines (SVM) based on stochastic gradient descent algorithm, three algorithms of improved stochastic gradient descent (SGD) are used to solve support vector machine, which are Momentum, Nesterov accelerated gradient (NAG), RMSprop. The experimental results show that the algorithm based on RMSprop for solving the l...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

Incorporating Nesterov Momentum into Adam

نویسنده

چکیده

منابع مشابه

Don't Decay the Learning Rate, Increase the Batch Size

Acceleration of Gradient-based Path Integral Method for Efficient Optimal and Inverse Optimal Control

Online Learning Rate Adaptation with Hypergradient Descent

Three-cocycles, Nonassociative Gauge Transformations and Dirac’s Monopole

Improved Stochastic gradient descent algorithm for SVM

عنوان ژورنال:

اشتراک گذاری